CodeGRAG: Extracting Composed Syntax Graphs for Retrieval Augmented Cross-Lingual Code Generation

Paper: arXiv:2405.02355v1
Authors: [Author List]
Published: May 2024

📄 Abstract

This paper introduces CodeGRAG, an efficient, tuning-free framework that enhances Large Language Models (LLMs) in single-round and cross-lingual code generation by injecting structured external knowledge. The core innovation combines Retrieval-Augmented Generation (RAG) with a novel, composite code representation.

💡 Core Problem and Solution

Problem

General LLMs treat code as a sequence of tokens, ignoring its inherent logical and structural complexity (the syntactic gap). This results in missed semantic cues, reduced code quality, and difficulty transferring knowledge across different programming languages (the cross-lingual challenge). Existing methods that rely on multi-prompt solutions are computationally expensive.

Solution: CodeGRAG (Syntax Graph RAG)

CodeGRAG addresses this by extracting a structured representation of code blocks—the Composed Syntax Graph—and retrieving the most relevant graph from a knowledge base to augment the LLM's prompt. This allows the LLM to access structural knowledge without full model fine-tuning.

🏗️ The Methodology

The framework operates in three stages:

Syntax Knowledge Base Preparation:
- Code blocks in the retrieval pool are converted into a Composed Syntax Graph.
- This graph combines the Data Flow Graph (DFG) (semantic dependencies) and the Control Flow Graph (CFG) (logical execution order), both built upon the Abstract Syntax Tree (AST), plus read-write signals. This combined view provides a purified, high-level representation of the code's inherent logic.
Knowledge Querying:
- A query is created by concatenating the problem description and the function declaration of the target task.
- A retrieval function then finds the most relevant graph from the knowledge base. This function uses:
  - A pre-trained text/code encoder (like CodeT5+ or UniXcoder) for the natural language query and raw code.
  - A designed Graph Neural Network (GNN) to encode the graphical views.
- Contrastive learning is used to align the query, raw code, and the graph embeddings (Question-Answer and Code-Graph alignment).
Knowledge Augmented Generation:
- The retrieved Composed Syntax Graph is formatted and prepended into the LLM's prompt as external knowledge alongside the original problem statement.
- The LLM (e.g., GPT-3.5-Turbo) generates the final code using this augmented context, requiring only a single call.

📈 Key Results

Experiments on the HumanEval-X multi-lingual benchmark showed significant performance gains (measured by Pass@1):

Performance Gain (RQ1 & RQ2): Using the CodeGRAG method consistently outperforms the base LLM and models augmented with only raw code blocks (Code-RAG).
- Example (C++): GPT-3.5-Turbo base (57.93) -> CodeGRAG (62.20)
- Example (Python): Code-RAG (72.56) -> CodeGRAG (77.44) (cross-lingual)
Cross-Lingual Augmentation (RQ3): CodeGRAG enables cross-lingual enhancement, notably improving Python generation when retrieving C++ graphs. This confirms that the Composed Syntax Graph abstracts high-level logic that is shared across different programming languages.
- Example (Python with cross-lingual C++ graph retrieval): GPT-3.5-Turbo base (71.95) -> CodeGRAG (77.44).
Structural Components (RQ5): Ablation studies showed that edge features (flow types) matter the most for the model's structural understanding, followed by the topological structure.

🎯 Conclusion

CodeGRAG successfully validates the principle that structured, graph-based knowledge is highly effective for improving code LLMs. By designing an efficient RAG system around a novel Composed Syntax Graph, the authors provide a tuning-free method that enhances code quality, fluency, and cross-lingual knowledge transfer.

📄 Abstract​

💡 Core Problem and Solution​

Problem​

Solution: CodeGRAG (Syntax Graph RAG)​

🏗️ The Methodology​

📈 Key Results​

🎯 Conclusion​